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Abstract 

We study fairness in classification, where individuals are classified, e.g., admitted to a uni- 
versity, and the goal is to prevent discrimination against individuals based on their membership 
in some group, while maintaining utility for the classifier (the university). The main conceptual 
contribution of this paper is a framework for fair classification comprising (1) a (hypothetical) 
task-specific metric for determining the degree to which individuals are similar with respect to the 
classification task at hand; (2) an algorithm for maximizing utility subject to the fairness constraint, 
that similar individuals are treated similarly. We also present an adaptation of our approach to 
achieve the complementary goal of "fair affirmative action," which guarantees statistical parity 
(i.e., the demographics of the set of individuals receiving any classification are the same as the 
demographics of the underlying population), while treating similar individuals as similarly as 
possible. Finally, we discuss the relationship of fairness to privacy: when fairness implies privacy, 
and how tools developed in the context of differential privacy may be applied to fairness. 
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1 Introduction 



In this work, we study fairness in classification. Nearly all classification tasks face the challenge of 
achieving utility in classification for some purpose, while at the same time preventing discrimination 
against protected population subgroups. A motivating example is membership in a racial minority in 
the context of banking. An article in The Wall Street Journal (8/4/2010) describes the practices of a 
credit card company and its use of a tracking network to learn detailed demographic information about 
each visitor to the site, such as approximate income, where she shops, the fact that she rents children's 
videos, and so on. According to the article, this information is used to "decide which credit cards to 
show first-time visitors" to the web site, raising the concern of steering, namely the (illegal) practice of 
guiding members of minority groups into less advantageous credit offerings [SA10]. 

We provide a normative approach to fairness in classification and a framework for achieving it. 
Our framework permits us to formulate the question as an optimization problem that can be solved 
by a linear program. In keeping with the motivation of fairness in online advertising, our approach 
will permit the entity that needs to classify individuals, which we call the vendor, as much freedom as 
possible, without knowledge of or trust in this party. This allows the vendor to benefit from investment 
in data mining and market research in designing its classifier, while our absolute guarantee of fairness 
frees the vendor from regulatory concerns. 

Our approach is centered around the notion of a task-specific similarity metric describing the extent 
to which pairs of individuals should be regarded as similar for the classification task at hand. 1 The 
similarity metric expresses ground truth. When ground truth is unavailable, the metric may reflect the 
"best" available approximation as agreed upon by society. Following established tradition [RawOl], the 
metric is assumed to be public and open to discussion and continual refinement. Indeed, we envision 
that, typically, the distance metric would be externally imposed, for example, by a regulatory body, or 
externally proposed, by a civil rights organization. 

The choice of a metric need not determine (or even suggest) a particular classification scheme. 
There can be many classifiers consistent with a single metric. Which classification scheme is chosen 
in the end is a matter of the vendor's utility function which we take into account. To give a concrete 
example, consider a metric that expresses which individuals have similar credit worthiness. One 
advertiser may wish to target a specific product to individuals with low credit, while another advertiser 
may seek individuals with good credit. 

1.1 Key Elements of Our Framework 

Treating similar individuals similarly. We capture fairness by the principle that any two individuals 
who are similar with respect to a particular task should be classified similarly. In order to accomplish 
this individual-based fairness, we assume a distance metric that defines the similarity between the 
individuals. This is the source of "awareness" in the title of this paper. We formalize this guiding 
principle as a Lipschitz condition on the classifier. In our approach a classifier is a randomized 
mapping from individuals to outcomes, or equivalently, a mapping from individuals to distributions 
over outcomes. The Lipschitz condition requires that any two individuals x, y that are at distance 
d(x,y) € [0, 1] map to distributions M{x) and M(y), respectively, such that the statistical distance 
between M(x) and M(y) is at most d(x, y). In other words, the distributions over outcomes observed by 
x and y are indistinguishable up to their distance d(x, y). 

1 Strictly speaking, we only require a function d : Vx V — > R where V is the set of individuals, d(x, y) > 0, dix, y) = d(y, x) 
and d(x, x) = 0. 
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Formulation as an optimization problem. We consider the natural optimization problem of con- 
structing fair (i.e., Lipschitz) classifiers that minimize the expected utility loss of the vendor. We 
observe that this optimization problem can be expressed as a linear program and hence solved effi- 
ciently. Moreover, this linear program and its dual interpretation will be used heavily throughout our 
work. 

Connection between individual fairness and group fairness. Statistical parity is the property 
that the demographics of those receiving positive (or negative) classifications are identical to the 
demographics of the population as a whole. Statistical parity speaks to group fairness rather than 
individual fairness, and appears desirable, as it equalizes outcomes across protected and non-protected 
groups. However, we demonstrate its inadequacy as a notion of fairness through several examples 
in which statistical parity is maintained, but from the point of view of an individual, the outcome 
is blatantly unfair. While statistical parity (or group fairness) is insufficient by itself, we investigate 
conditions under which our notion of fairness implies statistical parity. In Section 3, we give conditions 
on the similarity metric, via an Earthmover distance, such that fairness for individuals (the Lipschitz 
condition) yields group fairness (statistical parity). More precisely, we show that the Lipschitz 
condition implies statistical parity between two groups if and only if the Earthmover distance between 
the two groups is small. This characterization is an important tool in understanding the consequences 
of imposing the Lipschitz condition. 

Fair affirmative action. In Section 4, we give techniques for forcing statistical parity when it is 
not implied by the Lipschitz condition (the case of preferential treatment), while preserving as much 
fairness for individuals as possible. We interpret these results as providing a way of achieving fair 
affirmative action. 

A close relationship to privacy. We observe that our definition of fairness is a generalization of the 
notion of differential privacy [Dwo06, DMNS06]. We draw an analogy between individuals in the 
setting of fairness and databases in the setting of differential privacy. In Section 5 we build on this 
analogy and exploit techniques from differential privacy to develop a more efficient variation of our 
fairness mechanism. We prove that our solution has small error when the metric space of individuals 
has small doubling dimension, a natural condition arising in machine learning applications. We also 
prove a lower bound showing that any mapping satisfying the Lipschitz condition has error that scales 
with the doubling dimension. Interestingly, these results also demonstrate a quantiative trade-off 
between fairness and utility. Finally, we touch on the extent to which fairness can hide information 
from the advertiser in the context of online advertising. 

Prevention of certain evils. We remark that our notion of fairness interdicts a catalogue of dis- 
criminatory practices including the following, described in Appendix A: redlining; reverse redlining; 
discrimination based on redundant encodings of membership in the protected set; cutting off business 
with a segment of the population in which membership in the protected set is disproportionately high; 
doing business with the "wrong" subset of the protected set (possibly in order to prove a point); and 
"reverse tokenism." 
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1.2 Discussion: The Metric 



As noted above, the metric should (ideally) capture ground truth. Justifying the availability of or access 
to the distance metric in various settings is one of the most challenging aspects of our framework, and 
in reality the metric used will most likely only be society's current best approximation to the truth. Of 
course, metrics are employed, implicitly or explicitly, in many classification settings, such as college 
admissions procedures, advertising ("people who buy X and live in zipcode Y are similar to people 
who live in zipcode Z and buy W"), and loan applications (credit scores). Our work advocates for 
making these metrics public. 

An intriguing example of an existing metric designed for the health care setting is part of the 
AALIM project [AAL], whose goal is to provide a decision support system for cardiology that helps 
a physician in finding a suitable diagnosis for a patient based on the consensus opinions of other 
physicians who have looked at similar patients in the past. Thus the system requires an accurate 
understanding of which patients are similar based on information from multiple domains such as 
cardiac echo videos, heart sounds, ECGs and physicians' reports. AALIM seeks to ensure that 
individuals with similar health characteristics receive similar treatments from physicians. This work 
could serve as a starting point in the fairness setting, although it does not (yet?) provide the distance 
metric that our approach requires. We discuss this further in Section 6.1. 

Finally, we can envision classification situations in which it is desirable to "adjust" or otherwise 
"make up" a metric, and use this synthesized metric as a basis for determining which pairs of individuals 
should be classified similarly. 2 Our machinery is agnostic as to the "correctness" of the metric, and so 
can be employed in these settings as well. 

1.3 Related Work 

There is a broad literature on fairness, notably in social choice theory, game theory, economics, and law. 
Among the most relevant are theories of fairness and algorithmic approaches to apportionment; see, for 
example, the following books: H. Peyton Young's Equity, John Roemer's Equality of Opportunity and 
Theories of Distributive Justice, as well as John Rawls' A Theory of Justice and Justice as Fairness: A 
Restatement. Calsamiglia [Cal05] explains, 

"Equality of opportunity defines an important welfare criterion in political philosophy 
and policy analysis. Philosophers define equality of opportunity as the requirement 
that an individual's well being be independent of his or her irrelevant characteristics. 
The difference among philosophers is mainly about which characteristics should be 
considered irrelevant. Policymakers, however, are often called upon to address more 
specific questions: How should admissions policies be designed so as to provide equal 
opportunities for college? Or how should tax schemes be designed so as to equalize 
opportunities for income? These are called local distributive justice problems, because 
each policymaker is in charge of achieving equality of opportunity to a specific issue." 

In general, local solutions do not, taken together, solve the global problem: "There is no mechanism 
comparable to the invisible hand of the market for coordinating distributive justice at the micro 
into just outcomes at the macro level" [You95], (although Calsamiglia's work treats exactly this 
problem [Cal05]). Nonetheless, our work is decidedly "local," both in the aforementioned sense and in 

2 This is consistent with the practice, in some college admissions offices, of adding a certain number of points to SAT 
scores of students in disadvantaged groups. 
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our definition of fairness. To our knowledge, our approach differs from much of the literature in our 
fundamental skepticism regarding the vendor; we address this by separating the vendor from the data 
owner, leaving classification to the latter. 

Concerns for "fairness" also arise in many contexts in computer science, game theory, and 
economics. For example, in the distributed computing literature, one meaning of fairness is that a 
process that attempts infinitely often to make progress eventually makes progress. One quantitative 
meaning of unfairness in scheduling theory is the maximum, taken over all members of a set of long- 
lived processes, of the difference between the actual load on the process and the so-called desired load 
(the desired load is a function of the tasks in which the process participates) [AAN + 98]; other notions 
of fairness appear in [BS06, Fei08, FT11], to name a few. For an example of work incorporating 
fairness into game theory and economics see the eponymous paper [Rab93] . 

2 Formulation of the Problem 

In this section we describe our setup in its most basic form. We shall later see generalizations of this 
basic formulation. Individuals are the objects to be classified; we denote the set of individuals by V. In 
this paper we consider classifiers that map individuals to outcomes. We denote the set of outcomes 
by A. In the simplest non-trivial case A = {0, 1} . To ensure fairness, we will consider randomized 
classifiers mapping individuals to distributions over outcomes. To introduce our notion of fairness 
we assume the existence of a metric on individuals d : V x V — » R. We will consider randomized 
mappings M: V — > A(A) from individuals to probability distributions over outcomes. Such a mapping 
naturally describes a randomized classification procedure: to classify x € V choose an outcome a 
according to the distribution M(x). We interpret the goal of "mapping similar people similarly" to 
mean that the distributions assigned to similar people are similar. Later we will discuss two specific 
measures of similarity of distributions, and D tv , of interest in this work. 

Definition 2.1 (Lipschitz mapping). A mapping M : V — > A(A) satisfies the (D, d)-Lipschitz property 
if for every x, y e V, we have 

D(Mx, My) < d(x, y) . (1) 
When D and d are clear from the context we will refer to this simply as the Lipschitz property. 

We note that there always exists a Lipschitz classifier, for example, by mapping all individuals 
to the same distribution over A. Which classifier we shall choose thus depends on a notion of utility. 
We capture utility using a loss function L: V x A — > R. This setup naturally leads to the optimization 
problem: 

Find a mapping from individuals to distributions over outcomes that minimizes expected 
loss subject to the Lipschitz condition. 

2.1 Achieving Fairness 

Our fairness definition leads to an optimization problem in which we minimize an arbitrary loss function 
L: VxA ^ R while achieving the (D, <i)-Lipschitz property for a given metric d: V x V — > R. We 
denote by I an instance of our problem consisting of a metric d : V x V — > R, and a loss function 
L : VxA — > R. We denote the optimal value of the minimization problem by opt( I), as formally defined 
in Figure 1. We will also write the mapping M: V — > A(A) as M = {p x } xe v where fi x - M(x) e A(A). 
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def 

opt(J) = min E E L(x,a) (2) 

(ftthreV x~V a~ii x 

subject to Vx, y e V, : D(p x ,p y ) < d(x,y) (3) 
VxeV: fi x e A(A) (4) 

Figure 1: The Fairness LP: Loss minimization subject to fairness constraint 

Probability Metrics The first choice for D that may come to mind is the statistical distance: Let 
P, Q denote probability measures on a finite domain A. The statistical distance or total variation norm 
between P and Q is denoted by 

Av(P,0 = ^|P(a)-G(«)l- (5) 

aeA 

The following lemma is easily derived from the definitions of opt(J) and D tv . 

Lemma 2.1. Let D - D tv . Given an instance I we can compute opt(X) with a linear program of size 
poly(|V|,|A|). 

Remark 2.1. When dealing with the set V, we have assumed that V is the set of real individuals (rather 
than the potentially huge set of all possible encodings of individuals). More generally, we may only 
have access to a subsample from the set of interest. In such a case, there is the additional challenge of 
extrapolating a classifier over the entire set. 

A weakness of using D tv as the distance measure on distributions, it that we should then assume that 
the distance metric (measuring distance between individuals) is scaled such that for similar individuals 
d(x, y) is very close to zero, while for very dissimilar individuals d(x, y) is close to one. A potentially 
better choice for D in this respect is sometimes called relative Coo metric: 

Doo(P,Q) = sup log max ^ — — , — — M . (6) 
aeA \ { Q(a) P(a) j I 

With this choice we think of two individuals x, y as similar if d(x, y) <sc 1 . In this case, the Lipschitz 
condition in Equation 1 ensures that x and y map to similar distributions over A. On the other hand, 
when x, y are very dissimilar, i.e., d(x, y) » 1, the condition imposes only a weak constraint on the 
two corresponding distributions over outcomes. 

Lemma 2.2. Let D - Da,. Given an instance I we can compute opt(i") with a linear program of size 
poly(|V|,|A|). 

Proof. We note that the objective function and the first constraint are indeed linear in the vari- 
ables p x (a), as the first constraint boils down to requirements of the form p x (a) < e d< ~ x ' lj) p. y (a). The 
second constraint p x e A(A) can easily be rewritten as a set of linear constraints. □ 

Notation. Recall that we often write the mapping M: V — > A(A) as M = {[i x } x ev where p x = M(x) € 
A(A). In this case, when S is a distribution over V we denote by ^5 the distribution over A defined as 
ps («) = E. v ~s p x {a) where aeA. 
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Useful Facts It is not hard to check that both D tv and Doo are metrics with the following properties. 
Lemma 2.3. D tv (P, Q) < 1 - expC-D^/ 5 , Q)) < D m (P, Q) 

Fact 2.1. For any three distributions P, Q,R and non-negative numbers a,/3 > such that a + ft = 1, 
we have D tv (aP + PQ,R) < aD ty (P,R) + J3D tv (Q,R). 

Post-Processing. An important feature of our definition is that it behaves well with respect to post- 
processing. Specifically, if M: V — » A(A) is (D, J)-Lipschitz for D € {D tv , A»} and /: A — > 5 is 
any possibly randomized function from A to another set fi, then the composition / o M : V — > A(S) 
is a (Z>, J)-Lipschitz mapping. This would in particular be useful in the setting of the example in 
Section 2.2. 

2.2 Example: Ad network 

Here we expand on the example of an advertising network mentioned in the Introduction. We explain 
how the Fairness LP provides a fair solution protecting against the evils described in Appendix A. The 
Wall Street Journal article [SA10] describes how the [x+1] tracking network collects demographic 
information about individuals, such as their browsing history, geographical location, and shopping 
behavior, and utilizes this to assign a person to one of 66 groups. For example, one of these groups is 
"White Picket Fences," a market segment with median household income of just over $50,000, aged 25 
to 44 with kids, with some college education, etc. Based on this assignment to a group, CapitalOne 
decides which credit card, with particular terms of credit, to show the individual. In general we view a 
classification task as involving two distinct parties: the data owner is a trusted party holding the data 
of individuals, and the vendor is the party that wishes to classify individuals. The loss function may be 
defined solely by either party or by both parties in collaboration. In this example, the data owner is the 
ad network [x+1], and the vendor is CapitalOne. 

The ad network ([x+1]) maintains a mapping from individuals into categories. We can think of 
these categories as outcomes, as they determine which ads will be shown to an individual. In order to 
comply with our fairness requirement, the mapping from individuals into categories (or outcomes) will 
have to be randomized and satisfy the Lipschitz property introduced above. Subject to the Lipschitz 
constraint, the vendor can still express its own belief as to how individuals should be assigned to 
categories using the loss function. However, since the Lipschitz condition is a hard constraint there is 
no possibility of discriminating between individuals that are deemed similar by the metric. In particular, 
this will disallow arbitrary distinctions between protected individuals, thus preventing both reverse 
tokenism and the self-fulfilling prophecy (see Appendix A). In addition, the metric can eliminate the 
existence of redundant encodings of certain attributes thus also preventing redlining of those attributes. 
In Section 3 we will see a characterization of which attributes are protected by the metric in this way. 

2.3 Connection to Differential Privacy 

Our notion of fairness may be viewed as a generalization of differential privacy [Dwo06, DMNS06]. 
As it turns out our notion can be seen as a generalization of differential privacy. To see this, consider 
a simple setting of differential privacy where a database curator maintains a database x (thought 
of as a subset of some universe U) and a data analyst is allowed to ask a query F : V — > A on the 
database. Here we denote the set of databases by V = 2 U and the range of the query by A. A mapping 
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M: V — » A(A) satisfies e-differential privacy if and only if M satisfies the (Deo, <i)-Lipschitz property, 

def 

where, letting xAy denote the symmetric difference between x and y, we define d(x, y) - e\xAy\. 

The utility loss of the analyst for getting an answer a e A from the mechanism is defined 
as L(x,a) = dA(Fx,a), that is distance of the true answer from the given answer. Here distance 
refers to some distance measure in A that we described using the notation dA- For example, when 
A = R, this could simply be dA(a,b) - \a - b\. The optimization problem (2) in Figure 1 (i.e., 

def 

opt(J) = minE^v E a ^ r L(x,a)) now defines the optimal differentially private mechanism in this 
setting. We can draw a conceptual analogy between the utility model in differential privacy and that 
in fairness. If we think of outcomes as representing information about an individual, then the vendor 
wishes to receive what she believes is the most "accurate" representation of an individual. This is quite 
similar to the goal of the analyst in differential privacy. 

In the current work we deal with more general metric spaces than in differential privacy. Neverthe- 
less, we later see (specifically in Section 5) that some of the techniques used in differential privacy 
carry over to the fairness setting. 

3 Relationship between Lipschitz property and statistical parity 

In this section we discuss the relationship between the Lipschitz property articulated in Definition 2.1 
and statistical parity. As we discussed earlier, statistical parity is insufficient as a general notion 
of fairness. Nevertheless statistical parity can have several desirable features, e.g., as described in 
Proposition 3.1 below. In this section we demonstrate that the Lipschitz condition naturally implies 
statistical parity between certain subsets of the population. 
Formally, statistical parity is the following property. 

Definition 3.1 (Statistical parity). We say that a mapping M : V — > A(A) satisfies statistical parity 
between distributions S and T up to bias e if 

D tv (p s ,p T ) < e . (7) 

Proposition 3.1. Let M : V — > A(A) be a mapping that satisfies statistical parity between two sets S 
and T up to bias e. Then, for every set of outcomes O c A, we have the following two properties. 

1. |Pr {Mix) eO\xeS} - Pr [M(x) e O \ x e T}\ < e, 

2. |Pr \x e S \ M(x) e 0} - Pr {x e T \ M(x) e 0})\ < € . 

Intuitively, this proposition says that if M satisfies statistical parity, then members of S are equally 
likely to observe a set of outcomes as are members of T. Furthermore, the fact that an individual 
observed a particular outcome provides no information as to whether the individual is a member of S or 
a member of T. We can always choose T = S c in which case we compare S to the general population. 

3.1 Why is statistical parity insufficient? 

Although in some cases statistical parity appears to be desirable - in particular, it neutralizes redundant 
encodings - we now argue its inadequacy as a notion of fairness, presenting three examples in which 
statistical parity is maintained, but from the point of view of an individual, the outcome is blatantly 
unfair. In describing these examples, we let S denote the protected set and S c its complement. 
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Example 1: Reduced Utility. Consider the following scenario. Suppose in the culture of S the most 
talented students are steered toward science and engineering and the less talented are steered 
toward finance, while in the culture of S c the situation is reversed: the most talented are steered 
toward finance and those with less talent are steered toward engineering. An organization 
ignorant of the culture of 5 and seeking the most talented people may select for "economics," 
arguably choosing the wrong subset of S , even while maintaining parity. Note that this poor 
outcome can occur in a "fairness through blindness" approach - the errors come from ignoring 
membership in S . 

Example 2: Self-fulfilling prophecy. This is when unqualified members of S are chosen, in order 
to "justify" future discrimination against S (building a case that there is no point in "wasting" 
resources on S). Although senseless, this is an example of something pernicious that is not ruled 
out by statistical parity, showing the weakness of this notion. A variant of this apparently occurs 
in selecting candidates for interviews: the hiring practices of certain firms are audited to ensure 
sufficiently many interviews of minority candidates, but less care is taken to ensure that the best 
minorities - those that might actually compete well with the better non-minority candidates - 
are invited [Zarl 1]. 

Example 3: Subset Targeting. Statistical parity for S does not imply statistical parity for subsets of 
S . This can be maliciously exploited in many ways. For example, consider an advertisement 
for a product X which is targeted to members of S that are likely to be interested in X and to 
members of S that are very unlikely to be interested in X. Clicking on such an ad may be 
strongly correlated with membership in S (even if exposure to the ad obeys statistical parity). 

3.2 Earthmover distance: Lipschitz versus statistical parity 

A fundamental question that arises in our approach is: When does the Lipschitz condition imply 
statistical parity between two distributions S and T on V? We will see that the answer to this question 
is closely related to the Earthmover distance between S and T, which we will define shortly. 

The next definition formally introduces the quantity that we will study, that is, the extent to which 
any Lipschitz mapping can violate statistical parity. In other words, we answer the question, "How 
biased with respect to S and T might the solution of the fairness LP be, in the worst case?" 

Definition 3.2 (Bias). We define 

def 

bias D ,d(5,r) = max/zs(0)-/ir(0), (8) 
where the maximum is taken over all (D, <f)-Lipschitz mappings M = {p.x}xev mapping V into A({0, 1}). 

Note that bias^^S, T) e [0, 1]. Even though in the definition we restricted ourselves to mappings 
into distributions over {0, 1}, it turns out that this is without loss of generality, as we show next. 

Lemma 3.1. Let D e {D tv ,Doo} and let M: V — > A(A) be any (D,d)-Lipschitz mapping. Then, M 
satisfies statistical parity between S and T up to bias^^S, T). 

Proof. Let M = {fi x }xev be any (D, <i)-Lipschitz mapping into A. We will construct a (D, J)-Lipschitz 
mapping M' : V — > A({0, 1}) which has the same bias between S and T as M. 
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Indeed, let A s = {a e A: yUs(a) > Ht(o)} and let A T - A c s . Put n' x (0) = n x (M) an d n' x W = Hx(A T ). 
We claim that M' - {n' x } x ev is a (D, <f)-Lipschitz mapping. In both cases D e {D tv , Doo} this follows 
directly from the definition. On the other hand, it is easy to see that 

AvG"s,j"r) = D w {n' s ,n' T ) = n' s (0) - n' T (0) < bias D4 (S,T). □ 

Earthmover Distance. We will presently relate bias D ^(5, T) for D € {D tv , D^} to certain Earth- 
mover distances between S and T, which we define next. 

Definition 3.3 (Earthmover distance). Let <x: V x V — > R be a nonnegative distance function. The 
cr-Earthmover distance between two distributions S and T, denoted o~em(S, T), is defined as the value 
of the so-called Earthmover LP: 

cr EM (S,T) d = min ^ h(x,y)o-(x,y) 

x,yeV 

subject to ^ h(x, y) - S (x) 

yeV 

^ Ky, x) = T(x) 

yeV 

h(x, y) > 

We will need the following standard lemma, which simplifies the definition of the Earthmover 
distance in the case where cr is a metric. 

Lemma 3.2. Let d: V xV ^ Hbe a metric. Then, 

dEM(S,T)= min ^ h{x,y)d{x,y) 

x,yeV 

subject to ^ h(x, y) - ^ h(y, x) + S (x) - T(x) 

yeV yeV 

h(x, y) > 

Theorem 3.3. Let d be a metric. Then, 

bias Dtv , d (S,T)<d EM (S,T). (9) 
If furthermore d(x, y) < 1 for all x, y, then we have 

bias Dty , d (S,T)>d EM (S,T). (10) 

Proof. The proof is by linear programming duality. We can express biaso tv ,d(S, T) as the following 
linear program: 

bias(S ,T) - max £ S (x)n x (0) - ^ T(x)» x (0) 

xeV xeV 

subject to ju x (0) - < d(x, y) 

lu x (0)+MV = 1 
j«x(a) > 
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Here, we used the fact that 



D u (ji x ,ii y ) < d(x, y) <=^> |^(0) - fi y (0)\ < d(x, y) . 

The constraint on the RHS is enforced in the linear program above by the two constraints /i Y (0)-/i y (0) < 
d(x, y) and fi y (0) - fi x (0) < d(x, y). 

We can now prove (9). Since d is a metric, we can apply Lemma 3.2. Let {f{x, y)} x ,yev be a 
solution to the LP defined in Lemma 3.2. By putting e x = for all x e V, we can extend this to a 
feasible solution to the LP defining bias(5\ T) achieving the same objective value. Hence, we have 
bias(,S, T) < d^wiiS, T). 

Let us now prove (10), using the assumption that d(x,y) < 1. To do so, consider dropping the 
constraint that /u x (0) + /u x (l) = 1 and denote by yS(S, T) the resulting LP: 

P(S , T) d = max £ 5 (x)nM - £ T(x) Mx (0) 

xeV xeV 

subject to ju t (0) - yUy(O) < d(x, y) 
Mx(0) > 

It is clear that fi{S,T) > bias^r) and we claim that in fact bias(5 , ,T) > fl{S,T). To see this, 
consider any solution {fi x (0)} x ev to fl{S, T). Without changing the objective value we may assume 
that min^gy yu x (0) = 0. By our assumption that d(x, y) < 1 this means that max ve y /u x (0) < 1. Now put 
/i x (l) = 1 - /ix(0) e [0, 1]. This gives a solution to bias(5\ T) achieving the same objective value. We 
therefore have, 

bias(5,T) =/3(S,T), 
On the other hand, by strong LP duality, we have 

fi(S,T)= min ^ h(x,y)d(x,y) 

x,yeV 

subject to h(x, y) > h(y, x) + S (x) - T(x) 

yeV yeV 

h(x, y)>0 

It is clear that in the first constraint we must have equality in any optimal solution. Otherwise we can 
improve the objective value by decreasing some variable h(x, y) without violating any constraints. 

Since d is a metric we can now apply Lemma 3.2 to conclude that [3{S, T) = d^u{S, T) and thus 
bias(,S, T) - d^wiiS, T). □ 

Remark 3.1. Here we point out a different proof of the fact that bias£ tvj£ /(,S, T) < c?em(>S\ T) which 
does not involve LP duality. Indeed d^MiS, T) can be interpreted as giving the cost of the best coupling 
between the two distributions S and T subject to the penalty function d(x, y). Recall, a coupling is a 
distribution (X, Y) over V x V such that the marginal distributions are S and T, respectively. The cost 
of the coupling is Ed(X, Y). It is not difficult to argue directly that any such coupling gives an upper 
bound on bias^^S, T). We chose the linear programming proof since it leads to additional insight 
into the tightness of the theorem. 

The situation for biaso^ ^ is somewhat more complicated and we do not get a tight characterization 
in terms of an Earthmover distance. We do however have the following upper bound. 
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Lemma 3.4. 



bias Doojd (S, T) < bias Avjrf (5, T) 



(11) 



Proof. By Lemma 2.3, we have D tv (pi x ,/j y ) < Doo(n x ,n y ) for any two distributions p x ,p y . Hence, 
every (Doo, <i)-Lipschitz mapping is also (D tv , <i)-Lipschitz. Therefore, bias/) tv ^(5, T) is a relaxation 
of biasD^dOS, T). □ 

Corollary 3.5. 

bias Dooi(/ (5,r)<^ E M(5,r) (12) 
For completeness we note the dual linear program obtained from the definition of bias^^S, T) : 



biaso^, d(S, T) - min 




xeV 

subject to y"> + e * > Z fiy > x)ed{X,y) + S {x) ~ T{x) ( 1 3) 

yeV yeV 

2 g{x, y) + e x >Y J 9iy, x)e d(x ' y) ( 14) 

y€V yeV 
f(x,y),g(x,y) > 

Similar to the proof of Theorem 3.3, we may interpret this program as a flow problem. The variables 
f(x,y),g(x,y) represent a nonnegative flow from x to y and e x are slack variables. Note that the 
variables e x are unrestricted as they correspond to an equality constraint. The first constraint requires 
that x has at least S(x) - T(x) outgoing units of flow in /. The RHS of the constraints states that the 
penalty for receiving a unit of flow from y is e ^ x ' y \ However, it is no longer clear that we can get rid 
of the variables e x , g(x, y). 

Open Question 3.1. Can we achieve a tight characterization of when (Doo,d)-Lipschitz implies 
statistical parity? 



4 Fair Affirmative Action 

In this section, we explore how to implement what may be called fair affirmative action. Indeed, a 
typical question when we discuss fairness is, "What if we want to ensure statistical parity between two 
groups S and T, but members of S are less likely to be "qualified"? In Section 3, we have seen that 
when S and T are "similar" then the Lipschitz condition implies statistical parity. Here we consider 
the complementary case where 5 and T are very different and imposing statistical parity corresponds 
to preferential treatment. This is a cardinal question, which we examine with a concrete example 
illustrated in Figure 2. 

For simplicity, let T - S c . Assume |5|/|TU5| - 1/10, so S is only 10% of the population. Suppose 
that our task-specific metric partitions 5 U T into two groups, call them Go and G\, where members of 
Gi are very close to one another and very far from all members of G\-i. Let 5,-, respectively T,, denote 
the intersection S n G,-, respectively T n G,-, for i = 0, 1. Finally, assume \Sq\ = \Tq\ = 9|5|/10. Thus, 
Go contains less than 20% of the total population, and is equally divided between S and T. 

The Lipschitz condition requires that members of each G, be treated similarly to one another, but 
there is no requirement that members of Go be treated similarly to members of G\. The treatment of 
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Go 



G 1 




Figure 2: 5 = G n 5, T = G n T 



members of S , on average, may therefore be very different from the treatment, on average, of members 
of T, since members of S are over-represented in Go and under-represented in G\. Thus the Lipschitz 
condition says nothing about statistical parity in this case. 

Suppose the members of G, are to be shown an advertisement ad, for a loan offering, where the 
terms in adi are superior to those in ado. Suppose further that the distance metric has partitioned the 
population according to (something correlated with) credit score, with those in G\ having higher scores 
than those in Go. 

On the one hand, this seems fair: people with better ability to repay are being shown a more 
attractive product. Now we ask two questions: "What is the effect of imposing statistical parity?" and 
"What is the effect of failing to impose statistical parity?" 

Imposing Statistical Parity. Essentially all of S is in Go, so for simplicity let us suppose that indeed 
So = S c Go- In this case, to ensure that members of S have comparable chance of seeing adi as do 
members of T, members of S must be treated, for the most part, like those in T\. In addition, by the 
Lipschitz condition, members of Tq must be treated like members of Sq = S , so these, also, are treated 
like T\, and the space essentially collapses, leaving only trivial solutions such as assigning a fixed 
probability distribution on the advertisements (ado, adi) and showing ads according to this distribution 
to each individual, or showing all individuals ad,- for some fixed i. However, while fair (all individuals 
are treated identically), these solutions fail to take the vendor's loss function into account. 

Failing to Impose Statistical Parity. The demographics of the groups G, differ from the demo- 
graphics of the general population. Even though half the individuals shown ado are members of S 
and half are members of T, this in turn can cause a problem with fairness: an "anti-5"' vendor can 
effectively eliminate most members of S by replacing the "reasonable" advertisement ado offering 
less good terms, with a blatantly hostile message designed to drive away customers. This eliminates 
essentially all business with members of S , while keeping intact most business with members of T. 
Thus, if members of S are relatively far from the members of T according to the distance metric, then 
satisfying the Lipschitz condition may fail to prevent some of the unfair practices. 
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4.1 An alternative optimization problem 

With the above discussion in mind, we now suggest a different approach, in which we insist on 
statistical parity, but we relax the Lipschitz condition between elements of 5 and elements of S c . This 
is consistent with the essence of preferential treatment, which implies that elements in 5 are treated 
differently than elements in T. The approach is inspired by the use of the Earthmover relaxation in the 
context of metric labeling and O-extension [KT02, CKNZ04]. Relaxing the 5 x T Lipschitz constraints 
also makes sense if the information about the distances between members of 5 and members of T is of 
lower quality, or less reliable, than the internal distance information within these two sets. 
We proceed in two steps: 

1 . (a) First we compute a mapping from elements in 5 to distributions over T which transports 

the uniform distribution over 5 to the uniform distribution over T, while minimizing 
the total distance traveled. Additionally the mapping preserves the Lipschitz condition 
between elements within 5 . 

(b) This mapping gives us the following new loss function for elements of T: For y e T and 
aeAwe define a new loss, L'(y, a), as 

L'(y, a) = 2_j Hx(y)L(x, a) + L(y, a) , 
xeS 

where {fi x } xe $ denotes the mapping computed in step (a). U can be viewed as a reweighting 
of the loss function L, taking into account the loss on 5 (indirectly through its mapping to 
T). 

2. Run the Fairness LP only on T, using the new loss function L' . 
Composing these two steps yields a a mapping from V = 5 U T into A. 

Formally, we can express the first step of this alternative approach as a restricted Earthmover problem 
defined as 

def 

dEM+h(S,T) - min E E d{x,y) (15) 

xeS y~Mx 

subject to D(ji x , fj.^) < d(x, x') for all x,x'eS 
D tv (n s ,U T ) < e 
H x e A(T) for all x g 5 

Here, Uj denotes the uniform distribution over T. Given [fi x }x€S which minimizes (15) and {v x } x€ t 
which minimizes the original fairness LP (2) restricted to T, we define the mapping M : V — > A(A) by 
putting 

(V x X € T 

(16) 
E y ^ r Vy xeS 

Before stating properties of the mapping M we make some remarks. 

1. Fundamentally, this new approach shifts from minimizing loss, subject to the Lipschitz con- 
straints, to minimizing loss and disruption of 5 x T Lipschitz requirement, subject to the parity 
and 5x5 and T x T Lipschitz constraints. This gives us a bicriteria optimization problem, with 
a wide range of options. 
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2. We also have some flexibility even in the current version. For example, we can eliminate the 
re-weighting, prohibiting the vendor from expressing any opinion about the fate of elements 
in S . This makes sense in several settings. For example, the vendor may request this due to 
ignorance (e.g., lack of market research) about S , or the vendor may have some (hypothetical) 
special legal status based on past discrimination against 5 . 

3. It is instructive to compare the alternative approach to a modification of the Fairness LP in which 
we enforce statistical parity and eliminate the Lipschitz requirement on 5 x T. The alternative 
approach is more faithful to the S xT distances, providing protection against the self-fulfilling 
prophecy discussed in the Introduction, in which the vendor deliberately selects the "wrong" 
subset of S while still maintaining statistical parity. 

4. A related approach to addressing preferential treatment involves adjusting the metric in such a 
way that the Lipschitz condition will imply statistical parity. This coincides with at least one 
philosophy behind affirmative action: that the metric does not fully reflect potential that may 
be undeveloped because of unequal access to resources. Therefore, when we consider one of 
the strongest individuals in S , affirmative action suggests it is more appropriate to consider this 
individual as similar to one of the strongest individuals of T (rather than to an individual of T 
which is close according to the original distance metric). In this case, it is natural to adjust the 
distances between elements in S and T rather than inside each one of the populations (other than 
possibly re-scaling). This gives rise to a family of optimization problems: 

Find a new distance metric d! which "best approximates" d under the condition that 
S and T have small Earthmover distance under d', 

where we have the flexibility of choosing the measure of quality to how well d! approximates d. 

Let M be the mapping of Equation 16. The following properties of M are easy to verify. 
Proposition 4.1. The mapping M defined in ( 16) satisfies 

1. statistical parity between S and T up to bias e, 

2. the Lipschitz condition for every pair (x,y) € (S X S) U (T X T). 
Proof. The first property follows since 



We have given up the Lipschitz condition between S and T, instead relying on the terms d(x, y) in 
the objective function to discourage mapping x to distant z/'s. It turns out that the Lipschitz condition 
between elements x e 5 and y € T is still maintained on average and that the expected violation is 
given by dm/i+h(S, T) as shown next. 



D tv (M(S),M(T)) = Av E E v y , E v x J < D tv (p s , U T ) < e. 




The second claim is trivial for (x, y) € T xT. So, let (x, y) e 5 x 5. Then, 



D{M(x\M(y)) < D(n x ,Hy) < d(x,y) . 



□ 
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Proposition 4.2. Suppose D = D tv in (15). Then, the resulting mapping M satisfies 



E max \D tv (M(x), M(y)) - d(x, yj\ < d EM+L (S, T) . 



Proof. For every x e S and y e T we have 



Dtv(M(x),M(j/)) = Av E M(z),M(y) 




< E D tv (M(z),M(y)) 



(by Fact 2.1) 



< E J(z,i/) 



(Proposition 4.1 since z,y € T) 



< g?(x, j/) + E <i(x, z) 



(by triangle inequalities) 



The proof is completed by taking the expectation over x e 5 . 



□ 



An interesting challenge for future work is handling preferential treatment of multiple protected 
subsets that are not mutually disjoint. The case of disjoint subsets seems easier and in particular 
amenable to our approach. 

5 Small loss in bounded doubling dimension 

The general LP shows that given an instance I, it is possible to find an "optimally fair" mapping in 
polynomial time. The result however does not give a concrete quantitative bound on the resulting loss. 
Further, when the instance is very large, it is desirable to come up with more efficient methods to 
define the mapping. 

We now give a fairness mechanism for which we can prove a bound on the loss that it achieves 
in a natural setting. Moreover, the mechanism is significantly more efficient than the general linear 
program. Our mechanism is based on the exponential mechanism [MT07], first considered in the 
context of differential privacy. 

We will describe the method in the natural setting where the mapping M maps elements of V to 
distributions over V itself. The method could be generalized to a different set A as long as we also have 
a distance function defined over A and some distance preserving embedding of V into A. A natural 
loss function to minimize in the setting where V is mapped into distributions over V is given by the 
metric d itself. In this setting we will give an explicit Lipschitz mapping and show that under natural 
assumptions on the metric space (V, d) the mapping achieves small loss. 

Definition 5.1. Given a metric d : V x V — > R the exponential mechanism E : V — > A( V) is defined by 
putting 



Lemma 5.1 ([MT07]). The exponential mechanism is {D^, d)-Lipschitz. 

One cannot in general expect the exponential mechanism to achieve small loss. However, this 
turns out to be true in the case where (V, d) has small doubling dimension. It is important to note that 
in differential privacy, the space of databases does not have small doubling dimension. The situation 




where Z x = Z yeV . 
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in fairness is quite different. Many metric spaces arising in machine learning applications do have 
bounded doubling dimension. Hence the theorem that we are about to prove applies in many natural 
settings. 

Definition 5.2. The doubling dimension of a metric space (V, d) is the smallest number k such that for 
every x € V and every R > the ball of radius R around x, denoted B(x,R) = \y € V: d(x, y) < R} can 
be covered by 2 k balls of radius R/2. 

We will also need that points in the metric space are not too close together. 

Definition 5.3. We call a metric space (V, d) well separated if there is a positive constant e > such 
that \B(x, e)\ = 1 for all x € V. 

Theorem 5.2. Let d be a well separated metric space of bounded doubling dimension. Then the 
exponential mechanism satisfies 

E E d(x, y) = 0(1) . 

xeVy~E(x) 

Proof. Suppose d has doubling dimension k. It was shown in [CG08] that doubling dimension k 
implies for every R > that 

E \B(x,2R)\ < 2 k ' E \B(x,R)\, (17) 

xeV xeV 

where k' = 0(k). It follows from this condition and the assumption on (V, d) that for some positive 

€ > 0, 



E \B{x, 1)| < (-) E \B{x, e)\ = 2 0(k) . 
xeV \ej xeV 



(18) 



Then, 



E E d(x, y) < 1 + E 



xeV y~E(x) 



xeV 



Ji ^7 



\B(x, r)\dr 



< 1 + E 



re~ r \B(x, r)\dr 

/"•CO 

= 1 + re' r E \B(x, r)\dr 

re- r / 

< 1+2 OW [ r k> + l e -r 

Jo 

< 1 + 2 0(k \k' + 2)!. 
As we assumed that k = 0(1), we conclude 

E E d(x, y) < 2 0(k \k' + 2)! < 0(1) . 

xeVy~E(x) 



E \B(x, l)|dr 



r dr 



(since Z x > e~ d ^ = 1) 



(using (18)) 



Remark 5.1. If (V, d) is not well-separated, then for every constant e > 0, it must contain a well- 
separated subset V' c V such that every point x € V has a neighbor x' € V such that d(x, x') < e. 
A Lipschitz mapping M' defined on V naturally extends to all of V by putting M(x) = M'(x') 
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where x' is the nearest neighbor of x in V . It is easy to see that the expected loss of M is only an 
additive e worse than that of M' . Similarly, the Lipschitz condition deteriorates by an additive 2e, 
i.e., D ao {M{x),M{y)) < d(x,y) + 2e . Indeed, denoting the nearest neighbors in V of x, y by x',y' 
respectively, we have Doo(M(x), M(i/)) - Doo(M'(x'), M' («/')) ^ d(x',y') < d(x,y)+d(x, x')+d(y,y') < 
d(x, y) + 2e. Here, we used the triangle inequality. 

The proof of Theorem 5.2 shows an exponential dependence on the doubling dimension k of the 
underlying space in the error of the exponential mechanism. The next theorem shows that the loss of 
any Lipschitz mapping has to scale at least linearly with k. The proof follows from a packing argument 
similar to that in [HT10]. The argument is slightly complicated by the fact that we need to give a lower 
bound on the average error (over x e V) of any mechanism. 

Definition 5.4. A set B c V is called an R-packing if d(x, y) > R for all x,y e B. 

Here we give a lower bound using a metric space that may not be well-separated. However, 
following Remark 5.1, this also shows that any mapping denned on a well-separated subset of the 
metric space must have large error up to a small additive loss. 

Theorem 5.3. For every k>2 and every large enough n > no(k) there exists an n-point metric space 
of doubling dimension 0{k) such that any {D^, d)-Lipschitz mapping M: V — > A(V) must satisfy 

E E d(x, y) > Q(k) . 

xeVy~M(x) 

Proof. Construct V by randomly picking n points from a r-dimensional sphere of radius 100&. We will 
choose n sufficiently large and r = 0(k). Endow V with the Euclidean distance d. Since V c R'" and 
r = 0(k) it follows from a well-known fact that the doubling dimension of (V, d) is bounded by 0(k). 

Claim 5.4. Let X be the distribution obtained by choosing a random x € V and outputting a random 
y € B(x, k). Then, for sufficiently large n, the distribution X has statistical distance at most 1 / 100 from 
the uniform distribution over V. 

Proof. The claim follows from standard arguments showing that for large enough n every point y e V 
is contained in approximately equally many balls of radius k. □ 

Let M denote any (Deo, <5?)-Lipschitz mapping and denote its error on a point x € V by 

R(x)= E d(x,y). 

y~M(x) 

and put R = E x€V R(x). Let G = {x e V: R(x) < 2R}. By Markov's inequality |G| > n/2. 

Now, pick x € V uniformly at random and choose a set P x of 2 2k random points (with replacement) 
from B(x, k). For sufficiently large dimension r = 0(k), it follows from concentration of measure on 
the sphere that P x forms a &/2-packing with probability, say, 1/10. 

Moreover, by Claim 5.4, for random x € V and random y € B(x, k), the probability that y € G is at 
least |G|/|V| - 1/100 > 1/3. Hence, with high probability, 

\P x nG\>2 2k /\0. (19) 

Now, suppose M satisfies R < k/100. We will lead this to a contradiction thus showing that M 
has average error at least kj 100. Indeed, under the assumption that R < kj 100, we have that for every 
y e G, 

Vr{M{y)eB{y,kl50)}> 1 -, (20) 
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and therefore 



1 >Ft{m(x) eUy €PxnG B(y,k/2)} 




> 



2 exp(-£)Pr(MQ/) e B(y,k/2)) 




exp(-£) 

To 2 



This is a contradiction which shows that R > &/100. 



□ 



Open Question 5.1. Can we improve the exponential dependence on the doubling dimension in our 
upper bound? 

6 Discussion and Future Directions 

In this paper we introduced a framework for characterizing fairness in classification. The key element 
in this framework is a requirement that similar people be treated similarly in the classification. We 
developed an optimization approach which balanced these similarity constraints with a vendor's loss 
function, and analyzed when this local fairness condition implies statistical parity, a strong notion 
of equal treatment. We also presented an alternative formulation enforcing statistical parity, which 
is especially useful to allow preferential treatment of individuals from some group. We remark that 
although we have focused on using the metric as a method of defining and enforcing fairness, one can 
also use our approach to certify fairness (or to detect unfairness). This permits us to evaluate classifiers 
even when fairness is defined based on data that simply isn't available to the classification algorithm 3 . 
Below we consider some open questions and directions for future work. 

6.1 On the Similarity Metric 

As noted above, one of the most challenging aspects of our work is justifying the availability of 
a distance metric. We argue here that the notion of a metric already exists in many classification 
problems, and we consider some approaches to building such a metric. 

6.1.1 Denning a metric on individuals 

The imposition of a metric already occurs in many classification processes. Examples include credit 
scores 4 for loan applications, and combinations of test scores and grades for some college admissions. 
In some cases, for reasons of social engineering, metrics may be adjusted based on membership in 
various groups, for example, to increase geographic and ethnic diversity. 

The construction of a suitable metric can be partially automated using existing machine learning 
techniques. This is true in particular for distances d(x, y) where x and y are both in the same protected 

3 This observation is due to Boaz Barak. 

4 We remark that the credit score is a one-dimensional metric that suggests an obvious interpretation as a measure of 
quality rather than a measure of similarity. When the metric is defined over multiple attributes such an interpretation is no 
longer clear. 
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set or both in the general population. When comparing individuals from different groups, we may need 
human insight and domain information. This is discussed further in Section 6.1.2. 

Another direction, which intrigues us but which have not yet pursued, is particularly relevant to 
the context of on-line services (or advertising): allow users to specify attributes they do or do not want 
to have taken into account in classifying content of interest. The risk, as noted early on in this work, 
is that attributes may have redundant encodings in other attributes, including encodings of which the 
user, the ad network, and the advertisers may all be unaware. Our notion of fairness can potentially 
give a refinement of the "user empowerment" approach by allowing a user to participate in defining 
the metric that is used when providing services to this user (one can imagine for example a menu of 
metrics each one supposed to protect some subset of attributes). Further research into the feasibility of 
this approach is needed, in particular, our discussion throughout this paper has assumed that a single 
metric is used across the board. Can we make sense out of the idea of applying different metrics to 
different users? 

6.1.2 Building a metric via metric labeling 

One approach to building the metric is to first build a metric on S c , say, using techniques from machine 
learning, and then "inject" members of 5 into the metric by mapping them to members of S in a 
fashion consistent with observed information. In our case, this observed information would come from 
the human insight and domain information mentioned above. Formally, this can be captured by the 
problem of metric labeling [KT02]: we have a collection of \S C \ labels for which a metric is defined, 
together with \S | objects, each of which is to be assigned a label. 

It may be expensive to access this extra information needed for metric labeling. We may ask the 
question of how much information do we need in order to approximate the result we would get were 
we to have all this information. This is related to our next question. 

6.1.3 How much information is needed? 

Suppose there is an unknown metric d* (the right metric) that we are trying to find. We can ask an 
expert panel to tell us d*(x, y) given (x, y) € V 2 . The experts are costly and we are trying to minimize 
the number of calls we need to make. The question is: How many queries q do we need to make to be 
able to compute a metric d: V x V — > R such that the distortion between d and d* is at most C, i.e., 

<C. (21) 

The problem can be seen as a variant of the well-studied question of constructing spanners. A 
spanner is a small implicit representation of a metric d* . While this is not exactly what we want, it 
seems that certain spanner constructions work in our setting as well,are willing to relax the embedding 
problem by permitting a certain fraction of the embedded edges to have arbitrary distortion, as any 
finite metric can be embedded, with constant slack and constant distortion, into constant-dimensional 
Euclidean space [ABC + 05]. 

6.2 Case Study on Applications in Health Care 

An interesting direction for a case study is suggested by another Wall Street Journal article (1 1/19/2010) 
that describes the (currently experimental) practice of insurance risk assessment via online tracking. 



j d(x, y) d (x, y) \ 

sup max \ — -, — \ 

x,i,ev \d*(x,y) d(x,y) 
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For example, food purchases and exercise habits correlate with certain diseases. This is a stimulating, 
albeit alarming, development. In the most individual-friendly interpretation described in the article, 
this provides a method for assessing risk that is faster and less expensive than the current practice 
of testing blood and urine samples. "Deloitte and the life insurers stress the databases wouldn't be 
used to make final decisions about applicants. Rather, the process would simply speed up applications 
from people who look like good risks. Other people would go through the traditional assessment 
process." [SM10] Nonetheless, there are risks to the insurers, and preventing discrimination based on 
protected status should therefore be of interest: 

"The information sold by marketing-database firms is lightly regulated. But using it in the 
life-insurance application process would "raise questions" about whether the data would 
be subject to the federal Fair Credit Reporting Act, says Rebecca Kuehn of the Federal 
Trade Commission's division of privacy and identity protection. The law's provisions kick 
in when "adverse action" is taken against a person, such as a decision to deny insurance 
or increase rates." 

As mentioned in the introduction, the AALIM project [AAL] provides similarity information 
suitable for the health care setting. While their work is currently restricted to the area of cardiology, 
future work may extend to other medical domains. Such similarity information may be used to 
assemble a metric that decides which individual have similar medical conditions. Our framework could 
then employ this metric to ensure that similar patients receive similar health care policies. This would 
help to address the concerns articulated above. We pose it as an interesting direction for future work to 
investigate how a suitable fairness metric could be extracted from the AALIM system. 

6.3 Does Fairness Hide Information? 

We have already discussed the need for hiding (non-)membership in S in ensuring fairness. We now ask 
a converse question: Does fairness in the context of advertising hide information from the advertiser? 

Statistical parity has the interesting effect that it eliminates redundant encodings of S in terms of 
A, in the sense that after applying M, there is no / : A — > {0, 1 } that can be biased against S in any way. 
This prevents certain attacks that aim to determine membership in S . 

Unfortunately, this property is not hereditary. Indeed, suppose that the advertiser wishes to target 
HIV-positive people. If the set of HIV-positive people is protected, then the advertiser is stymied by 
the statistical parity constraint. However, suppose it so happens that the advertiser's utility function is 
extremely high on people who are not only HIV-positive but who also have AIDS. Consider a mapping 
that satisfies statistical parity for "HIV-positive," but also maximizes the advertiser's utility. We expect 
that the necessary error of such a mapping will be on members of "HIV\AIDS," that is, people who 
are HIV-positive but who do not have AIDS. In particular, we don't expect the mapping to satisfy 
statistical parity for "AIDS" - the fraction of people with AIDS seeing the advertisement may be much 
higher than the fraction of people with AIDS in the population as a whole. Hence, the advertiser can in 
fact target "AIDS". 

Alternatively, suppose people with AIDS are mapped to a region B c A, as is a |AIDS|/|HIV positive| 
fraction of HIV-negative individuals. Thus, being mapped to B maintains statistical parity for the set 
of HIV-positive individuals, meaning that the probability that a random HIV-positive individual is 
mapped to B is the same as the probability that a random member of the whole population is mapped 
to B. Assume further that mappings to A \ B also maintains parity. Now the advertiser can refuse to do 
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business with all people with AIDS, sacrificing just a small amount of business in the HIV- negative 
community. 

These examples show that statistical parity is not a good method of hiding sensitive information in 
targeted advertising. A natural question, not yet pursued, is whether we can get better protection using 
the Lipschitz property with a suitable metric. 
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A Catalog of Evils 

We briefly summarize here behaviors against which we wish to protect. We make no attempt to be 
formal. Let S be a protected set. 

1 . Blatant explicit discrimination. This is when membership in S is explicitly tested for and a 
"worse" outcome is given to members of S than to members of S c . 

2. Discrimination Based on Redundant Encoding. Here the explicit test for membership in 5 is 
replaced by a test that is, in practice, essentially equivalent. This is a successful attack against 
"fairness through blindness," in which the idea is to simply ignore protected attributes such as 
sex or race. However, when personalization and advertising decisions are based on months or 
years of on-line activity, there is a very real possibility that membership in a given demographic 
group is embedded holographically in the history. Simply deleting, say, the Facebook "sex" 
and "Interested in men/women" bits almost surely does not hide homosexuality. This point was 
argued by the (somewhat informal) "Gaydar" study [JM09] in which a threshold was found for 
predicting, based on the sexual preferences of his male friends, whether or not a given male is 
interested in men. Such redundant encodings of sexual preference and other attributes need not 
be explicitly known or recognized as such, and yet can still have a discriminatory effect. 
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3. Redlining. A well-known form of discrimination based on redundant encoding. The following 
definition appears in an article by [Hun05], which contains the history of the term, the practice, 
and its consequences: "Redlining is the practice of arbitrarily denying or limiting financial 
services to specific neighborhoods, generally because its residents are people of color or are 
poor." 

4. Cutting off business with a segment of the population in which membership in the protected set 
is disproportionately high. A generalization of redlining, in which members of S need not be a 
majority of the redlined population; instead, the fraction of the redlined population belonging to 
S may simply exceed the fraction of S in the population as a whole. 

5. Self-fulfilling prophecy. Here the vendor advertiser is willing to cut off its nose to spite its face, 
deliberately choosing the "wrong" members of S in order to build a bad "track record" for S . A 
less malicious vendor may simply select random members of S rather than qualified members, 
thus inadvertently building a bad track record for S . 

6. Reverse tokenism. This concept arose in the context of imagining what might be a convincing 
refutation to the claim "The bank denied me a loan because I am a member of S ." One possible 
refutation might be the exhibition of an "obviously more qualified" member of S c who is also 
denied a loan. This might be compelling, but by sacrificing one really good candidate c e S c the 
bank could refute all charges of discrimination against S . That is, c is a token rejectee; hence 
the term "reverse tokenism" ("tokenism" usually refers to accepting a token member of S). We 
remark that the general question of explaining decisions seems quite difficult, a situation only 
made worse by the existence of redundant encodings of attributes. 
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